Welcome to the NCSCO OCDS Intro track tutorial. We are Nick Gawron and Livia Popa, we will be working with you through today’s tutorial. We will be using R studio for today’s session.
We will be tackling these objectives:
Mr. Wuf works for Mount Mitchell State Park in Burnsville, NC and was recently asked by his boss to write a report summarizing rainfall and temperature data for 2021. This report will be used to help optimize 2022 event planning (e.g., fall color viewing) for park visitors and maintenance scheduling for park staff. He recently got promoted to the head office of the NC State Climate office! Mr. Wuf’s wife, Mrs. Wuf, recently told him about the State Climate Office of North Carolina’s new nClimgrid and Econet data portals. He agrees with her that it would be a great opportunity to check out these new, free tools. After some preliminary sleuthing around nClimgrid, he discovered there was a Monthly U.S Climate Gridded Dataset (nClimGrid; https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00332). How did he miss this? Once he downloads these data from nClimGrid and Econet, Mr. Wuf plans to put the skills he learned in an online R programming course to the test for this real-world, work-related project.
Suzie
-Cloud storage is a service model in which data is transmitted and stored on remote storage systems, where it is maintained, managed, backed up and made available to users over a network
The NOAA Monthly U.S. Climate Gridded Dataset (NClimGrid) consists of four climate variables derived from the GHCN-D dataset: maximum temperature, minimum temperature, average temperature and precipitation.
Each file provides monthly values in a 5x5 lat/lon grid for the Continental United States. Data is available from 1895 to the present.
On an annual basis, approximately one year of “final” nClimGrid will be submitted to replace the initially supplied “preliminary” data for the same time period.
We can find nClimGrid Data hosted on AWS servers here.
## Read in data from AWS nClimGrid 2020-2021
## From the decadal section with state labels
## Drop missing values
nclim_2020s<-read_csv('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/decadal/2020-2021-ste-decadal.csv')%>%drop_na()
## Rows: 32160 Columns: 10
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): month, state, region_code
## dbl (6): year, day, PRCP, TAVG, TMIN, TMAX
## date (1): date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Read in nClimGrid 2010-2019
## From decadal with state labels
## Use the fread command from data.table
nclim_2010s<-fread('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/decadal/2010-2019-ste-decadal.csv')%>%drop_na()
#If computation time takes a while ... we have a .Rda file
load("NCLIM2010.Rda")
load("NCLIM2020.Rda")
nclim as well as nclimk_rawnclim will be used for time series analyses and nclimk_raw will be used for k means clustering## Lets quickly combine the data sets to get a data set ranging from the above
## lets also filter for data record that pertain to North Carolina
nclim <- rbind(nclim_2020s,nclim_2010s)%>%filter(state == c("North Carolina", "South Carolina"))
## We are looking at values from January for all the US here, we will eventually subset to a certain state!
## Read in the data with county labels, call it nclimk_raw
nclimk_raw<-read_csv('https://noaa-nclimgrid-daily-pds.s3.amazonaws.com/EpiNOAA/csv/202101-cty-scaled.csv')
## Rows: 96317 Columns: 11
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): date, month, state, county, region_code
## dbl (6): year, day, PRCP, TAVG, TMIN, TMAX
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using the formula \((C \times \frac{9}{5}) + 32=F\), covert the temperature to Fahrenheit.
#create a function for basic syntax
# convert data value temperature F to C
# Call it CtoF
CtoF<- function(datcols){
x<- (9/5)*datcols+32
return(x)
}
We now will need to apply this function to the temperature columns.
#apply the function CtoF with tidyverse functions
# to the columns relating to temperature
colnames(nclim)
## [1] "date" "year" "month" "day" "state"
## [6] "region_code" "PRCP" "TAVG" "TMIN" "TMAX"
nclim[8:10]<- nclim[8:10]%>% lapply(CtoF)
head(nclim)
## # A tibble: 6 x 10
## date year month day state region_code PRCP TAVG TMIN TMAX
## <date> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2020-01-01 2020 01 1 North Caroli~ 31 0 45.1 34.4 55.8
## 2 2020-01-03 2020 01 3 North Caroli~ 31 9.65 47.9 38.1 57.7
## 3 2020-01-05 2020 01 5 North Caroli~ 31 4.39 47.9 34.8 61.0
## 4 2020-01-07 2020 01 7 North Caroli~ 31 0.56 44.4 31.2 57.6
## 5 2020-01-09 2020 01 9 North Caroli~ 31 0 43.1 29.7 56.5
## 6 2020-01-11 2020 01 11 North Caroli~ 31 2.2 54.5 44.7 64.3
## Apply Tidyverse piping to easily add degree labels to the end of certain columns for temp
## and to add units for precipitation
## Inspect
colnames(nclim)[8:10]<-colnames(nclim)[8:10]%>%tolower()%>% paste0("_degf")
colnames(nclim)[7]<-colnames(nclim)[7]%>%tolower()%>% paste0("_cm")
head(nclim)
## # A tibble: 6 x 10
## date year month day state region_code prcp_cm tavg_degf tmin_degf
## <date> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2020-01-01 2020 01 1 North Ca~ 31 0 45.1 34.4
## 2 2020-01-03 2020 01 3 North Ca~ 31 9.65 47.9 38.1
## 3 2020-01-05 2020 01 5 North Ca~ 31 4.39 47.9 34.8
## 4 2020-01-07 2020 01 7 North Ca~ 31 0.56 44.4 31.2
## 5 2020-01-09 2020 01 9 North Ca~ 31 0 43.1 29.7
## 6 2020-01-11 2020 01 11 North Ca~ 31 2.2 54.5 44.7
## # ... with 1 more variable: tmax_degf <dbl>
## Keep the weather data columns as well as the date, drop all other columns
## Further Create the variable ifRain: a factor to indicate whether it rained on a certain day
## Call this final data set nclimf
nclimf <- nclim %>% select(date,prcp_cm,tavg_degf,tmin_degf,tmax_degf,state)%>%
mutate(ifRain= as.factor(as.integer(prcp_cm>0)))%>%
select(-state)
head(nclimf)
## # A tibble: 6 x 6
## date prcp_cm tavg_degf tmin_degf tmax_degf ifRain
## <date> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 2020-01-01 0 45.1 34.4 55.8 0
## 2 2020-01-03 9.65 47.9 38.1 57.7 1
## 3 2020-01-05 4.39 47.9 34.8 61.0 1
## 4 2020-01-07 0.56 44.4 31.2 57.6 1
## 5 2020-01-09 0 43.1 29.7 56.5 0
## 6 2020-01-11 2.2 54.5 44.7 64.3 1
Below we do the same cleaning and data preparation that we did for nclim, with the exclusion of the ifNC variable. This is because we will be looking at all of the data
# we consider a certain state
state2consider="NY"
nclimk_last <- nclimk_raw%>%
filter(state==state2consider)%>%
select(-c(state,month,region_code,year,day))
#applying function CtoF
nclimk_last[4:6]<- nclimk_last[4:6]%>%lapply(CtoF)
# nclim_k colnames
colnames(nclimk_last)[4:6]<-colnames(nclimk_last)[4:6]%>%
tolower()%>%
paste0("_degf")
#colnames for percp.
colnames(nclimk_last)[3]<-colnames(nclimk_last)[3]%>%
tolower()%>%
paste0("_cm")
head(nclimk_last)
## # A tibble: 6 x 6
## date county prcp_cm tavg_degf tmin_degf tmax_degf
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01/1/2021 Albany County 0 30.7 24.3 37.1
## 2 01/2/2021 Albany County 13.0 28.0 22.2 33.8
## 3 01/3/2021 Albany County 0.02 35.0 28.8 41.1
## 4 01/4/2021 Albany County 5.6 30.7 29.9 31.4
## 5 01/5/2021 Albany County 0 32.6 29.7 35.4
## 6 01/6/2021 Albany County 0 31.3 28.1 34.4
tail(nclimk_last)
## # A tibble: 6 x 6
## date county prcp_cm tavg_degf tmin_degf tmax_degf
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01/26/2021 Yates County 0 21.5 10.2 32.9
## 2 01/27/2021 Yates County 3.69 29.3 24.7 33.8
## 3 01/28/2021 Yates County 0.56 21.0 15.4 26.6
## 4 01/29/2021 Yates County 0.84 13.6 7.36 19.9
## 5 01/30/2021 Yates County 0.67 12.1 7.59 16.7
## 6 01/31/2021 Yates County 0 13.9 4.50 23.3
# scatterplot of avg tempurature vs time, color the points red
## add labels
## save the plot at plotOTemp object
plotOtemp<-ggplot(nclim,aes(x=date,y=tavg_degf))+geom_point(colour="red")+labs(x = "Date", y = "Average Tempurature (F)",title ="Tempurature Over Time")
plotOtemp
## Using the plotOtemp object create a linear regression for the same data
plotOtemp+geom_smooth(method="lm",se=TRUE,col = "black")+theme_classic()
## `geom_smooth()` using formula 'y ~ x'
A time series is a collection of observations of well-defined data items obtained through repeated measurements over time
Time series analyze a sequence of data points collected over an interval of times
We will now try to fit a time series model to this data rather than a regression to see if the results improve. We will create a time series forecast with ARIMA (Auto Regressive Integrated Moving Average).
First we begin by creating an xts object
## with the variables date and average temperature, create a time series object.
## save as temp_ts
temp_ts<-xts(nclimf$tavg_degf,nclim$date)
# train/validation split
# 70% of data goes to the training data, split by date
train_date <- nclimf$date[round(nrow(temp_ts) *0.7)]
train <- temp_ts[index(temp_ts) <= train_date]
test <- temp_ts[index(temp_ts) > train_date]
# plot the time series object we called temp_ts
plot(temp_ts)
# build the time series model
model_ts <- auto.arima(as.numeric(train), trace=TRUE)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 19955.15
## ARIMA(1,0,0) with non-zero mean : 15768.92
## ARIMA(0,0,1) with non-zero mean : 18415.57
## ARIMA(0,0,0) with zero mean : 26854.35
## ARIMA(2,0,0) with non-zero mean : 15400.08
## ARIMA(3,0,0) with non-zero mean : 15349.15
## ARIMA(4,0,0) with non-zero mean : 15141.83
## ARIMA(5,0,0) with non-zero mean : 15144.75
## ARIMA(4,0,1) with non-zero mean : Inf
## ARIMA(3,0,1) with non-zero mean : Inf
## ARIMA(5,0,1) with non-zero mean : Inf
## ARIMA(4,0,0) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(4,0,0) with non-zero mean : 15153.06
##
## Best model: ARIMA(4,0,0) with non-zero mean
model_ts
## Series: as.numeric(train)
## ARIMA(4,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 mean
## 0.6543 0.3227 -0.3118 0.2889 61.5179
## s.e. 0.0195 0.0227 0.0228 0.0195 2.4412
##
## sigma^2 = 31.12: log likelihood = -7570.51
## AIC=15153.02 AICc=15153.06 BIC=15187.75
#plot the test data
plot(test)
Once the model has been created, we will use that to create a forecasting xts object
Finally we plot the validation (testing) data and layer the forecast on top for our time series model
# plot validation data with forecast using autoplot
forecast <- autoplot(forecast(model_ts, h=length(test)))
forecast
We want to create a function that can do a complicated process in as few steps as possible.
calculate_cluster is the name of the function we are going to create to compute k-means clusters among other information.
## Create a function to take in data and a number of k clusters
## We want to scale input data, perform k means, and output a dataframe with assigned clusters and silhouette score
calculate_cluster <- function(data, k) {
x <- data %>%
na.omit() %>%
scale()
df <- kmeans(x, center = k) %>%
augment(data) %>% # creates column ".cluster" with cluster label
# makes coloumn called silhouette, uses a cluster function called silhouette
mutate(silhouette = cluster::silhouette(as.integer(.cluster), dist(x))[, "sil_width"])
# calculate silhouette score
return(df)
}
# Convert nclimlast to data frame, call it nclimk
# Use plot_ly to create a 3D scatterplot
nclimk<-data.frame(nclimk_last)
##plot to the data of average tempurature w.r.t date
plot_ly(x=nclimk$prcp_cm,y=nclimk$tavg_degf,z=nclimk$tmax_degf)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
# data subset nclimk
# we want to only look at the numeric variables
# omit any missing values
# inspect data
nclimk_tocluster <- nclimk %>%
select(tavg_degf,prcp_cm,tmax_degf) %>%
na.omit()
head(nclimk_tocluster)
## tavg_degf prcp_cm tmax_degf
## 1 30.722 0.00 37.130
## 2 28.004 12.99 33.800
## 3 34.952 0.02 41.072
## 4 30.668 5.60 31.424
## 5 32.558 0.00 35.420
## 6 31.298 0.00 34.448
Now we want to set up a data frame to determine which runs
Below we will use the map function, more help can be found from this link
This works similar to lapply but runs C++ in the background
# map cluster calculations function to range of k values
# make cluster and silhoutte scores for each k
# Inspect data
temp_cluster_data <- tibble(k = 2:12) %>%
mutate(kclust = map(k, ~calculate_cluster(nclimk_tocluster, .x))) %>%
unnest(cols = c(kclust))
head(temp_cluster_data)
## # A tibble: 6 x 6
## k tavg_degf prcp_cm tmax_degf .cluster silhouette
## <int> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 2 30.7 0 37.1 1 0.582
## 2 2 28.0 13.0 33.8 1 0.211
## 3 2 35.0 0.02 41.1 1 0.570
## 4 2 30.7 5.6 31.4 1 0.361
## 5 2 32.6 0 35.4 1 0.582
## 6 2 31.3 0 34.4 1 0.560
str(temp_cluster_data)
## tibble [21,142 x 6] (S3: tbl_df/tbl/data.frame)
## $ k : int [1:21142] 2 2 2 2 2 2 2 2 2 2 ...
## $ tavg_degf : num [1:21142] 30.7 28 35 30.7 32.6 ...
## $ prcp_cm : num [1:21142] 0 12.99 0.02 5.6 0 ...
## $ tmax_degf : num [1:21142] 37.1 33.8 41.1 31.4 35.4 ...
## $ .cluster : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 1 ...
## $ silhouette: num [1:21142] 0.582 0.211 0.57 0.361 0.582 ...
# calculate average silhoutte score (highest for optimal number of k clusters)
# for each k value
# store these values for each value of k as temp_sil_score_data
temp_sil_score_data <- temp_cluster_data %>%
group_by(k) %>%
summarize(avg_sil_score = mean(silhouette, na.rm = TRUE))
head( temp_sil_score_data)
## # A tibble: 6 x 2
## k avg_sil_score
## <int> <dbl>
## 1 2 0.451
## 2 3 0.519
## 3 4 0.409
## 4 5 0.425
## 5 6 0.364
## 6 7 0.389
# find maximum
temp_optimal_sil_score_data <- temp_sil_score_data %>%
filter(max(avg_sil_score, na.rm = TRUE) == avg_sil_score)
# save optimal k
temp_optimal_k_value <- temp_optimal_sil_score_data$k
temp_optimal_k_value
## [1] 3
#3
# plot the data from temp_sil_score_data and visually inspect to determine the k
# with the largest sil score
# Add a marker point to visually show the max value!
temp_sil_score_data%>%
ggplot(mapping =aes(x=k,y=avg_sil_score))+
geom_line()+
geom_point()+
scale_x_continuous(breaks= pretty_breaks())+
geom_point(data=temp_sil_score_data%>%
filter(max(avg_sil_score)==avg_sil_score),
pch=22,
size=5,
colour="purple")+
labs(x="Value of k",y="Mean of Silhouette Score",title="Silhouette Score for Value of k")+
theme_light()
# save plot as image
ggsave("Silhouette_Plot_teacher.png")
## Saving 7 x 5 in image
# save optimal k cluster data as nclimk_optimal_cluster
nclimk_optimal_cluster <- temp_cluster_data %>%
filter(k == temp_optimal_k_value)
head(nclimk_optimal_cluster)
## # A tibble: 6 x 6
## k tavg_degf prcp_cm tmax_degf .cluster silhouette
## <int> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 3 30.7 0 37.1 3 0.701
## 2 3 28.0 13.0 33.8 1 0.598
## 3 3 35.0 0.02 41.1 3 0.653
## 4 3 30.7 5.6 31.4 3 0.218
## 5 3 32.6 0 35.4 3 0.700
## 6 3 31.3 0 34.4 3 0.692
tail(nclimk_optimal_cluster)
## # A tibble: 6 x 6
## k tavg_degf prcp_cm tmax_degf .cluster silhouette
## <int> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 3 21.5 0 32.9 3 0.214
## 2 3 29.3 3.69 33.8 3 0.536
## 3 3 21.0 0.56 26.6 2 0.387
## 4 3 13.6 0.84 19.9 2 0.656
## 5 3 12.1 0.67 16.7 2 0.623
## 6 3 13.9 0 23.3 2 0.656
# Combine nclim_optimal_cluster with appropriate county data as well as date
# Update column names
# Call this ClusterWCounty
ClusterWCounty<-cbind(nclimk_optimal_cluster,nclimk$county,nclimk$date)
colnames(ClusterWCounty)[7:8]=c("county","date")
head(ClusterWCounty)
## k tavg_degf prcp_cm tmax_degf .cluster silhouette county date
## 1 3 30.722 0.00 37.130 3 0.7013054 Albany County 01/1/2021
## 2 3 28.004 12.99 33.800 1 0.5982150 Albany County 01/2/2021
## 3 3 34.952 0.02 41.072 3 0.6527655 Albany County 01/3/2021
## 4 3 30.668 5.60 31.424 3 0.2178269 Albany County 01/4/2021
## 5 3 32.558 0.00 35.420 3 0.7001896 Albany County 01/5/2021
## 6 3 31.298 0.00 34.448 3 0.6918454 Albany County 01/6/2021
#function for the mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Create a common cluster vector for the new data frame called countyplot
# Change the name of the county to the appropriate fips code
countyPlot<-ClusterWCounty%>%group_by(county)%>%
summarise(Common_Cluster =mean(as.numeric((.cluster))))%>%
mutate(fips=fips(state=state2consider,county = county))%>%
select(-county)
## if i was to compute the mode
#countyPlot<-ClusterWCounty%>%group_by(county)%>%
# summarise(Common_Cluster =getmode(.cluster))%>%
# mutate(fips=fips(state=state2consider,county = county))%>%
# select(-county)
countyPlot<- data.frame(countyPlot)
head(countyPlot)
## Common_Cluster fips
## 1 2.548387 36001
## 2 2.677419 36003
## 3 2.838710 36005
## 4 2.387097 36007
## 5 2.677419 36009
## 6 2.548387 36011
# Using plot_usmap to plot North Carolina counties with their appropriate clusters
plot_usmap( data = countyPlot,
values = "Common_Cluster", "counties",
include = c(state2consider),
color="black")+
labs(title=paste0(state2consider," Cluster Mapping"))+
#can comment out the below line if we use the mode method.
scale_fill_continuous(low = "#FFCC00", high = "#CC0000", name="Common_Cluster", label=scales::comma)+
theme(legend.position="right")
#3d plot for all of our clusters, same as before but now apply cluster coloring from ClusterWCounty
plot_ly(x=nclimk$prcp_cm,y=nclimk$tavg_degf,z=nclimk$tmax_degf,color=ClusterWCounty$.cluster)
## No trace type specified:
## Based on info supplied, a 'scatter3d' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
cardinal <- read_csv("cardinal_data.csv",
col_types = list(`Average Air Temperature (F)` = col_number(),
`Maximum Air Temperature (F)` = col_number(),
`Minimum Air Temperature (F)` = col_number(),
`Average Experimental Leaf Wetness (mV)` = col_number(),
`Total Precipitation (in)` = col_number(),
`Average Relative Humidity (%)` = col_number(),
`Average Soil Moisture (m3/m3)` = col_number(),
`Average Soil Temperature (F)` = col_number(),
`Average Solar Radiation (W/m2)` = col_number(),
`Average Station Pressure (mb)` = col_number()))
## Warning: One or more parsing issues, see `problems()` for details
cardinal<-drop_na(cardinal)
str(cardinal)
## tibble [729 x 11] (S3: tbl_df/tbl/data.frame)
## $ Date : chr [1:729] "1/1/20" "1/2/20" "1/3/20" "1/4/20" ...
## $ Average Air Temperature (F) : num [1:729] 43.1 44.9 52.8 57.2 42.1 44.1 41.4 42.5 40.4 52 ...
## $ Maximum Air Temperature (F) : num [1:729] 53.6 55.4 64.9 65.1 50.5 58.5 52 57.6 50.5 65.8 ...
## $ Minimum Air Temperature (F) : num [1:729] 35.1 35.2 45.7 42.6 34.9 32 31.3 29.7 31.3 38.1 ...
## $ Average Experimental Leaf Wetness (mV): num [1:729] 266 274 362 373 265 ...
## $ Total Precipitation (in) : num [1:729] 0 0.05 0.95 0.52 0 0 0.07 0 0 0 ...
## $ Average Relative Humidity (%) : num [1:729] 63.8 72 92.1 83.5 57 ...
## $ Average Soil Moisture (m3/m3) : num [1:729] 0.28 0.28 0.29 0.35 0.33 0.31 0.3 0.3 0.3 0.29 ...
## $ Average Soil Temperature (F) : num [1:729] 48.6 47.6 51 54.6 48.3 46.1 44.6 43.3 43.3 46.1 ...
## $ Average Solar Radiation (W/m2) : num [1:729] 134.8 66 31.1 44.9 135.4 ...
## $ Average Station Pressure (mb) : num [1:729] 999 1003 998 993 1005 ...
cardinal$Date<-as.Date(cardinal$Date, tryFormats= c("%m/%d/%y"))
view(cardinal)
#changes col names
colnames(cardinal)=c("date","AvgT","MaxT","MinT","AvgLw","Tprep","AvgHum","AvgSm","AvgSt","AvgSr","AvgStp")
cardinal$IfRain<- (cardinal$Tprep>0)
cardinal$IfRain<-as.factor(as.integer(cardinal$IfRain))
ggplot(cardinal,aes(x=date,y=AvgT))+geom_line()+labs(title="Total Daily Rainfall by Date",y="Average Tempurature (F) ", x= "Date")
EDA is how we can motivate future ML models!
We can use forecasting to extend this trend!
using cardinal data to observe if there is clustering
used for future models
helps us describe higher dimensional data with less
Three general steps:
Remove heavily correlated columns! - Min Temp and Max Temp for a certain day will correlate with one another!
Center Data
Observe:
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(cardinal[,-c(1,12)]))
IfRainVar<- cardinal$IfRain
cardshort <- cardinal%>%select(-c(date,IfRain,Tprep,MinT,MaxT))
cardshort
## # A tibble: 729 x 7
## AvgT AvgLw AvgHum AvgSm AvgSt AvgSr AvgStp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 43.1 266. 63.8 0.28 48.6 135. 999.
## 2 44.9 274. 72.0 0.28 47.6 66.0 1003.
## 3 52.8 362. 92.1 0.29 51 31.1 998.
## 4 57.2 373 83.5 0.35 54.6 44.9 993.
## 5 42.1 265. 57.0 0.33 48.3 135. 1005.
## 6 44.1 265. 57.6 0.31 46.1 138. 1005.
## 7 41.4 274. 75.2 0.3 44.6 40.9 1002.
## 8 42.5 314. 58.9 0.3 43.3 136. 1010.
## 9 40.4 265. 60.2 0.3 43.3 122. 1022.
## 10 52 266. 73.5 0.29 46.1 74.6 1019.
## # ... with 719 more rows
pca_card<- princomp(scale(cardshort,scale=FALSE),cor = FALSE)
plot(pca_card$scores, pch = 16, col =IfRainVar)
legend("topright",c("No Rain","Rain"),pch=16,col=c("black","red"))
Here we can look at how good PCA does at describing changes in data
We see 2 components describes 96% of the data’s variation ! (This is very good)
summary(pca_card)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 93.1434884 45.4290622 18.05966455 6.31468558 5.494486034
## Proportion of Variance 0.7784786 0.1851865 0.02926584 0.00357804 0.002708918
## Cumulative Proportion 0.7784786 0.9636651 0.99293094 0.99650898 0.999217895
## Comp.6 Comp.7
## Standard deviation 2.9520630681 3.804718e-02
## Proportion of Variance 0.0007819752 1.298933e-07
## Cumulative Proportion 0.9999998701 1.000000e+00
screeplot(pca_card, type = "lines")
Any Questions?
Thank you for attending the beginner track tutorial at the Open Climate Data Science Workshop. Please complete the voluntary feedback survey, linked here. The main purposes of this survey is to (1) determine we met specified teaching goals and (2) improve teaching materials for subsequent tutorial sessions.